A Heuristic Method for Chinese Segmentation

نویسندگان

  • Christopher C. Yang
  • Johnny W.K. Luk
چکیده

Research and development in digital library includes content creation, conversion, indexing, organization, and dissemination, where the key technological issues are how to search and display desired selections from and across large collections effectively [10]. A repository is an indexed collection of objects. Indexing is an important task for searching. The better the indexing, the better the searching result. The smallest indexing units in Chinese digital library are words, while the smallest units in a Chinese sentence are characters. However, Chinese text has no delimiter to mark word boundaries as it is in English text. In English or other languages using Roman or Greek-based orthographies, spacing often reliably indicates word boundaries. In Chinese, a number of characters are placed together without any delimiters indicating the boundaries between consecutive characters. Previous work of Chinese word segmentation can be divided into three categories: (a) statistical approach, (b) lexical rule-based approach, and (c) hybrid approach based on statistical and lexical information. In this paper, we focus on the statistical approach. In particular, we investigate the heuristic approach based on significance estimation for Chinese segmentation. Experiments are conducted to evaluate its performance. The result of the heuristic approach is compared with that of the boundary detection approach. The heuristic approach has shown improvement in segmenting the unknown words.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A heuristic method based on a statistical approach for Chinese text segmentation

The authors propose a heuristic method for Chinese automatic text segmentation based on a statistical approach. This method is developed based on statistical information about the association among adjacent characters in Chinese text. Mutual information of bi-grams and significant estimation of tri-grams are utilized. A heuristic method with six rules is then proposed to determine the segmentat...

متن کامل

Segmenting Chinese Unknown Words by Heuristic Method

Chinese text segmentation is important in Chinese text indexing. Due to the lack of word delimiters in Chinese text, Chinese text segmentation is more difficult than English text segmentation. Besides, the segmentation ambiguities and the occurrences of out-of-vocabulary words (i.e. unknown words) are the major challenges in Chinese segmentation. Many research works dealing with the problem of ...

متن کامل

A Chinese-English Organization Name Translation System Using Heuristic Web Mining and Asymmetric Alignment

In this paper, we propose a novel system for translating organization names from Chinese to English with the assistance of web resources. Firstly, we adopt a chunkingbased segmentation method to improve the segmentation of Chinese organization names which is plagued by the OOV problem. Then a heuristic query construction method is employed to construct an efficient query which can be used to se...

متن کامل

Design of CKIP Chinese Word Segmentation System

In this paper, we describe the design of the CKIP Chinese word segmentation system and analyse its performance. The system utilizes a modulized approach. Independent modules were designed to solve the problems of segmentation ambiguities and identifying unknown words. Segmentation ambiguities are resolved by a hybrid method of using heuristic and statistical rules. Regular-type unknown words ar...

متن کامل

Realignment from Finer-grained Alignment to Coarser-grained Alignment to Enhance Mongolian-Chinese SMT

The conventional Mongolian-Chinese statistical machine translation (SMT) model uses Mongolian words and Chinese words to practice the system. However, data sparsity, complex Mongolian morphology and Chinese word segmentation (CWS) errors lead to alignment errors and ambiguities. Some other works use finer-grained Mongolian stems and Chinese characters, which suffer from information loss when in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000